download.file("https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityReds.csv","wineQualityReds.csv",method="curl")
redWine <- read.csv("wineQualityReds.csv")
names(redWine)
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
str(redWine)
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
summary(redWine)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
I found the range of Total Sulfur Dioxide surprisingly wide, so I chose to plot its histogram.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
I observed that the value of 289 for Total Sulfur Dioxide is an outlier and most of the wine have total sulfur dioxide value ranging from 22 to 62. The distribution is not a long tail distribution. It’s positively skewed.
Next let’s observe fixed acidity.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
Fixed acidity has close to normal distribution.
Let’s observe volatile acidity.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
It looks like the volatile acidity has bimodal distribution. Nevertheless, the two modes of the distribution are very close to each other. Most of the values are in the range of 0.39 to 0.64.
Let’s summarize citric acid.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
The distribution of citric acid is not normal. Most of the values are between 0.0 and 0.6.
Let’s analyse residual sugar.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
It seems like residual sugar as a normal distribution with long tail. Most of the values are located between 1.9 to 2.6.
Let’s focus on variable chlorides.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Chlorides have normal distribution with long tail. Most of the values are between 0.07 and 0.09
Let’s anlyse the variable alchohol.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Distribution of variable alcohol is positively skewed. Most of the values are between 9.50 and 11.10.
Let’s observe density variable.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
Density variable is distributed close to a perfectly normal distribution. Most of the values are between 0.9956 and 0.9978.
Let’s analyse sulphates variable.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Sulphates have mostly normal distribution which is slightly positively skewed with few outliers. Most values are located in the range 0.55 to 0.73
Analysing the pH variable.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
pH has close to normal distribution. Most of the values are in the range of 3.21 to 3.4. This indicates that wines are always acidic.
My dataset is related to the quality of red wine. Each observation mentions the chemical characteristics of the wine. There are 1599 observations of 13 variables. One variable named ‘quality’ indicates the quality rating given after tasting the wine. The other variables are fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates and alcohol. All the variables are continuous numeric variables. The quality variable can be seen as factor variable with 0 to 10 i.e. 11 possible levels, where 6 leves are present in dataset i.e . 3-8.
The main feature is of interest in this dataset are quality and the other factors which play a significant role in determining the quality of wine. Some of them are alcohol, volatile.acidity, total.sulfur.dioxide and chlorides.
Color, flavor, aroma of wine can signficantly affect the perception of quality. Therefore, having those features in this dataset would have helped performing better analysis and perform better ‘quality’ prediction model.
Yes, I calculated value of total acidity from two values in dataset called fixed.acidity and volatile.acidity. I used it in one of the plots to observe its relationship with pH. I didn’t store this variable in dataset though.
I found that citric.acid has unusual distribution where it’s not close to a normal distribution in any way as I expected. I didn’t perform any transformation on any variable.
## X fixed.acidity volatile.acidity
## X 1.000000000 -0.26848392 -0.008815099
## fixed.acidity -0.268483920 1.00000000 -0.256130895
## volatile.acidity -0.008815099 -0.25613089 1.000000000
## citric.acid -0.153551355 0.67170343 -0.552495685
## residual.sugar -0.031260835 0.11477672 0.001917882
## chlorides -0.119868519 0.09370519 0.061297772
## free.sulfur.dioxide 0.090479643 -0.15379419 -0.010503827
## total.sulfur.dioxide -0.117849669 -0.11318144 0.076470005
## density -0.368372087 0.66804729 0.022026232
## pH 0.136005328 -0.68297819 0.234937294
## sulphates -0.125306999 0.18300566 -0.260986685
## alcohol 0.245122841 -0.06166827 -0.202288027
## quality 0.066452608 0.12405165 -0.390557780
## citric.acid residual.sugar chlorides
## X -0.15355136 -0.031260835 -0.119868519
## fixed.acidity 0.67170343 0.114776724 0.093705186
## volatile.acidity -0.55249568 0.001917882 0.061297772
## citric.acid 1.00000000 0.143577162 0.203822914
## residual.sugar 0.14357716 1.000000000 0.055609535
## chlorides 0.20382291 0.055609535 1.000000000
## free.sulfur.dioxide -0.06097813 0.187048995 0.005562147
## total.sulfur.dioxide 0.03553302 0.203027882 0.047400468
## density 0.36494718 0.355283371 0.200632327
## pH -0.54190414 -0.085652422 -0.265026131
## sulphates 0.31277004 0.005527121 0.371260481
## alcohol 0.10990325 0.042075437 -0.221140545
## quality 0.22637251 0.013731637 -0.128906560
## free.sulfur.dioxide total.sulfur.dioxide density
## X 0.090479643 -0.11784967 -0.36837209
## fixed.acidity -0.153794193 -0.11318144 0.66804729
## volatile.acidity -0.010503827 0.07647000 0.02202623
## citric.acid -0.060978129 0.03553302 0.36494718
## residual.sugar 0.187048995 0.20302788 0.35528337
## chlorides 0.005562147 0.04740047 0.20063233
## free.sulfur.dioxide 1.000000000 0.66766645 -0.02194583
## total.sulfur.dioxide 0.667666450 1.00000000 0.07126948
## density -0.021945831 0.07126948 1.00000000
## pH 0.070377499 -0.06649456 -0.34169933
## sulphates 0.051657572 0.04294684 0.14850641
## alcohol -0.069408354 -0.20565394 -0.49617977
## quality -0.050656057 -0.18510029 -0.17491923
## pH sulphates alcohol quality
## X 0.13600533 -0.125306999 0.24512284 0.06645261
## fixed.acidity -0.68297819 0.183005664 -0.06166827 0.12405165
## volatile.acidity 0.23493729 -0.260986685 -0.20228803 -0.39055778
## citric.acid -0.54190414 0.312770044 0.10990325 0.22637251
## residual.sugar -0.08565242 0.005527121 0.04207544 0.01373164
## chlorides -0.26502613 0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide 0.07037750 0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide -0.06649456 0.042946836 -0.20565394 -0.18510029
## density -0.34169933 0.148506412 -0.49617977 -0.17491923
## pH 1.00000000 -0.196647602 0.20563251 -0.05773139
## sulphates -0.19664760 1.000000000 0.09359475 0.25139708
## alcohol 0.20563251 0.093594750 1.00000000 0.47616632
## quality -0.05773139 0.251397079 0.47616632 1.00000000
This table of correlation between variables gives us some idea about the pairs of variables which have significant correlation indicating a possible linear relationship.
The prsence of sulfur dioxides in the wine can be detected by smell if they are present in excess amount. I want to check if there is some relation between total sulfur dioxide and quality of wine.
##
## Pearson's product-moment correlation
##
## data: total.sulfur.dioxide and quality
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2320162 -0.1373252
## sample estimates:
## cor
## -0.1851003
As evident in the graph, the wines with higher sulfur dioxide have received rating of 5.
Let’s check the relationship between fixed acidity and quality.
##
## Pearson's product-moment correlation
##
## data: fixed.acidity and quality
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07548957 0.17202667
## sample estimates:
## cor
## 0.1240516
Wines with higher fixed acidity seem to have received higher quality ratings.
Next I analyse the relationship between volatile acidity and quality.
##
## Pearson's product-moment correlation
##
## data: volatile.acidity and quality
## t = -16.9542, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
Based on the graph and correlation coefficient, it seems that increasing volatile acidity reduces the perceived quality of the wine.
Let’s see if the chlorides has any signficiant relationship with the quality of wine.
##
## Pearson's product-moment correlation
##
## data: chlorides and quality
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.17681041 -0.08039344
## sample estimates:
## cor
## -0.1289066
It looks that for very high values of chlorides, the perceived quality of wine is low.
Let’s check how the quantity of sulphates affect the quality of wine.
##
## Pearson's product-moment correlation
##
## data: sulphates and quality
## t = 10.3798, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
In general, for the wines with higher amount of sulphates, the received quality ratings are high. But we must also notice that there are large number of samples with extremely high values of sulphates, the given quality rating is 5.
I see it’s mentioned in the description of variables in txt file accompanying the data that citric acid is added for freshness. I am curious to explore its relationship with the quality of wine.
##
## Pearson's product-moment correlation
##
## data: citric.acid and quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
There seems to be a weak relationship between citric acid and quality where high values of citric acid leads to lower quality rating.
And finally, let’s see if amount of alcohol in the wine affects the perceived quality of wine.
##
## Pearson's product-moment correlation
##
## data: alcohol and quality
## t = 21.6395, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
There is a significant positive correlation between amount of alcohol and perceived quality of wine.
##
## Pearson's product-moment correlation
##
## data: density and quality
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2220365 -0.1269870
## sample estimates:
## cor
## -0.1749192
As density decreases, the perceived quality of wine also decreases. This can be understood from the fact that alcohol which is major determiner of density is also a major determiner of perceived quality based on the graphs above.
I tried to see the relationship between quality and each different variable in the dataset. I drew scatter plot and calculated correlation coefficients for these variables. I found that the perceived quality of wine is proportional to the alcohol content and volatile acidity and it’s inversely proportional to the total sulfur dioxide and chlorides.
Yes, I found strong correlation between the pH and total acidity in the wine. By total acidity, I mean fixed plus volatile acidity combined. I also observed that the density of wine has strong negative correlation with the alcohol content in wine.
The correlation between the density of wine with the percentage of alcohol as the strongest relationship in this dataset. Other than that I also observed that the quality of wine is significantly determined by the percentage of alcohol it contains.
As I observed relationship between quality of wine with different variables in the dataset, I found that amount of alcohol, volatile acidity, chlorides and total sulfur dioxide has great impact on deciding the perceived quality of wine. As we can see, most of the high quality rated wine observarations(blue dots) are those with higher alcohol and lower volatile acidity, while many of the mediocre to low quality rated wines(orange dots) are having lower alcohol and high volatile acidity. Some of the best quality of the wines(pink dots) are those with high alcohol and medium level of volatile acidity.
Here, it’s visible that wines with relatively low total sulfur dioxide are getting higher qualtiy ratings(blue and pink dots). The wines with low alcohol are again getting lower quality rating.
Most of the blue dots are concentrated in the lower left region. It’s clearly visible here that wines with low volatile acidity and low total sulfur dioxide (blue and pink dots) have higher perceived quality than other samples.
Training a linear model with the dataset to predict the perceived quality of wine.
##
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + chlorides,
## data = redWine)
##
## Coefficients:
## (Intercept) alcohol volatile.acidity chlorides
## 3.1574 0.3106 -1.3821 -0.3343
As the model suggests, the percentage of alcohol present in wine positively contributes to the quality of wine while volatile.acidity and chlorides contribute negatively to quality as evident by the coefficients. ## Final Plots and Summary Here I present the main findings from analysis of this dataset.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
As we can see, the density of wine reduces as the percentage of alcohol in wine increases.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
As the total acidity increases, pH of the wine decreases.
The major success and learning from this project was that I could understand through scatter plots, which of the various chemical properties of wine affect the perceived quality of wine.
I didn’t find any major difficulties while dealing with this data.
Having information about the flavor, color and type of aroma (if it can be classified) could have significantly enriched our analysis, as I believe that this properties highly influence the perceived quality of wine.